During this short tutorial we will go through a sample workflow with the CAGEr R package (Haberle et al., 2015).
We will only go through all the standard functions and analyses of CAGEr in this practical.
CAGEr Analysis
Here, I have downloaded four ctss.bed.gz files from the F6 server and converted these already to CTSS files for CAGEr. CAGEr accepts bed files but these need to contain a single line per tag so the same position will be in the file multiple times. This is not the format of F6 or F5, which is why we need to convert these files. The CTSS file is a tab separated file with no header and comprises four columns:
Do not run, this is the code I used for your information. The 0-based ctss.bed files are easily converted to 1-based CTSS files. Note that we are using the “end” position (column 3) now for the CTSS. The fifth column being the number of tags at that position.
Linux:
for f in `ls ../data/basic/ctssfiles/CN*`
do
zcat $f | awk -v OFS='\t' '{print $1,$3,$6,$5}' > ../data/basic/ctsstables/table_$f.txt
done
We need to create a CAGEset object first. We will need to set the genome name, the paths to the input files (CTSS files in this case), what file type we’re importing, and the sample labels. The code here will need the files to be in this directory to work: ../data/ctss_tables. However, you can easily change that to where you have placed them.
### load the CAGEr package
library(CAGEr)
### BSgenome with the right version
library(BSgenome.Hsapiens.UCSC.hg38)
### define where the ctss.bed files provided are located for CAGEr
# where the files can be found
pathsToInputFiles <- list.files("../data/ctss_tables", full.names = TRUE)
### creating a CAGEset object
myCAGEset <- new("CAGEset", genomeName = "BSgenome.Hsapiens.UCSC.hg19", inputFiles = pathsToInputFiles, inputFilesType = "ctss", sampleLabels = paste("skin_",1:4, sep = ""))
# you can check the object:
myCAGEset
So this is now a still empty object and you can check what it contains at the moment:
Now we are ready upload the CTSS files and get the CTSSs. We access this slot with the function CTSStagCount(myCAGEset). Type it in and see what class it returns, and see what it looks like.
### reading in the data
getCTSS(myCAGEset)
# get a dataframe of ctss counts:
ctss <- CTSStagCount(myCAGEset)
head(ctss)
At this stage we only have Cage TSS (CTSS) information. The slot of Normalized tpm is still empty (as is the rest of the object). Let’s look at the raw data more closely!
What is the correlation of CTSS between all the samples? One easy way to find out with plotCorrelation(). If we assign the function we can save the matrix table and plot your own style of plots here. The plot will be in the working directory. Check out the matrix!
### creating a correlation plot and table of the samples
corr.m <- plotCorrelation(myCAGEset, samples = "all", method = "pearson")
To make samples comparable, we will need to normalize our data. CAGEr offers both a tags per million normalization and a power-law based normalization. Because many CAGE-seq data follow a power-law distrubtion (Balwierz et al., 2009), we’ll use the power-law based normalization here.
First we need to plot the reverse cumulatives with plotReverseCumulatives() if they actually follow this distribution. On the log-log scale, this reverse cumulative distribution will manifest as a monotonically decreasing linear function of which the slope (alpha) is determined per sample.
From this we can see the alpha for each sample. Select the alpha for the normalization step, the normalizeTagCount() function when method = “powerlaw”. What is the average alpha of the two samples? Is that the “Ref distribution” in the plot?
# check library sizes
librarySizes(myCAGEset)
# normalisation first plot to assess alpha
plotReverseCumulatives(myCAGEset, fitInRange = c(5, 1000), onePlot = TRUE)
normalizeTagCount(myCAGEset, method = "powerLaw", fitInRange = c(5, 1000), alpha = 1.20, T = 5*10^5)
We can plot this again to observe the change; now using the normalised tpm slot with values = "normalized".
Let’s check our object again, what is added and in which slot?
! We have used a different “T” than you would if the complete genome was there: T = 1*10^6
CTSS’ in close proximity of each other give rise to functionally similar set of transcripts within the same promoter elements. Thus, they are basically larger transcriptional units that correspond to individual promoters.
To capture these, tag clusters (TC) of CTSS’ can be easily defined within CAGEr with the function: clusterCTSS(). Run the code below to see the information in the viewer
?clusterCTSS
We will use a simple distance measurement (method = “distclu”). This means that the true distance in bp between individual CTSS is used. Let’s set this at 20 bp (maxDist = 20), so that individual CTSS can not be more than 20 bp apart to form the TC.
We also set the threshold at 1, which means that each CTSS should have 2 or more tag counts in all the samples prior to clustering so as to exclude low fidelity CTSSs.
Finally, no TC are to be called comprised of a single CTSS if the normalized signal is below 5:
clusterCTSS(object = myCAGEset,
threshold = 1,
thresholdIsTpm = TRUE,
nrPassThreshold = 1,
method = "distclu",
maxDist = 20,
removeSingletons = TRUE,
keepSingletonsAbove = 5)
Keep in mind that these are determined per sample
Have a look again at the myCAGEset!
Another feature we can determine with CAGEr is the promoter width. That is in this case, the width of each tag cluster per sample. To have a width more robust to low tag count outliers, CAGEr can determine the width based on a set of quantiles of the cumulative distribution of tag signal per TC.
Generally, the width of a TC (promoter) is set between the quantiles 0.1 and 0.9 to capture 80% of CTSS within the TC. To determine the width we follow these two steps:
cumulativeCTSSdistribution())quantilePositions())These are the most time consuming steps within CAGEr
We can easily plot a histogram of the interquantile widths per sample with CAGEr too
cumulativeCTSSdistribution(myCAGEset, clusters = "tagClusters")
quantilePositions(myCAGEset, clusters = "tagClusters", qLow = 0.1, qUp = 0.9)
plotInterquantileWidth(myCAGEset,
clusters = "tagClusters",
tpmThreshold = 3,
qLow = 0.1,
qUp = 0.9)
Until now, we have determined the TC and the width of those per individual sample. However, TCs are often sample-specific. This could mean that they are present in one sample but not in the other. More so, TCs do not coincide perfectly within the same region but overlap each other as well as the possibility of more than one TC in one sample and a less amount of TC in the other but larger in width (we’ll visualize an example of this later today).
When we want to compare transcriptional activity across samples, regions that encompass the TCs across the samples are needed. These are called consensus clusters that are aggregates of TCs from all samples. The function in CAGEr: aggregateTagClusters().
These we will define and add these to our myCAGEset object. We specify that the maximum distance is a 100 bp between the TCs (with the coordinates of quantile 0.1 and quantile 0.9) across the samples.
aggregateTagClusters(myCAGEset,
tpmThreshold = 5,
qLow = 0.1,
qUp = 0.9,
maxDist = 100)
CAGE signals are essentially measuring transcription starting from single CTSS and thus the transcription levels can be assessed. This can be done per CTSSs, TCs, or consensus clusters.
CAGEr offers two methods to cluster gene expression to identify gene expression patterns:
Both require the number of expression clusters to be known a priori.
Here we will perform the SOM algorithm at consensus cluster level for our four samples to identify expression patterns. We set the threshold to > 10 tpm (normalized) in at least one sample to make sure we have robustly expressed TCs within the consensus clusters.
The function is getExpressionProfiles() and in here we define that we expect in this case 4 clusters (xDim = 2, yDim = 2).
getExpressionProfiles(myCAGEset,
what = "consensusClusters",
tpmThreshold = 10,
nrPassThreshold = 1,
method = "som",
xDim = 2,
yDim = 2)
# plot it too
plotExpressionProfiles(myCAGEset, what = "consensusClusters")
What happens if you change the tpmThreshold? Or change the xDim and yDim?
Subsequently, a character vector of the expression classes can be extracted from the cageset. This will be the same order as the consensus clusters.
expr.clus = expressionClasses(myCAGEset, what = "consensusClusters")
Promoter shifting is described by Haberle et al, 2014. They’ve shown that the same promoter can be used differently in samples at different times in development. The method from this paper has been implemented in CAGEr. Shifting can be detected between two individual samples. If there are more than two samples, they can be merged per group (if possible) and then two groups are then compared.
The shifting uses, similary to determining promoter width, the cumulative distribution per sample of CAGE signal (here: within the consensus clusters). The difference between the cumulative distributions is calculated as a shifting score (see figure below). Additionally, a Kolmogorov-Smirnov test on the cumulative distributions is performed to give a general assessment of differential TSS usage.